Importance Sampling

1. Policy Update in REINFORCE

Let’s go back to the REINFORCE algorithm. We start with a policy, \pi_\theta, then using that policy, we generate a trajectory (or multiple ones to reduce noise) (s_t, a_t, r_t). Afterward, we compute a policy gradient, g, and update \theta' \leftarrow \theta + \alpha g.

At this point, the trajectories we’ve just generated are simply thrown away. If we want to update our policy again, we would need to generate new trajectories once more, using the updated policy.

You might ask, why is all this necessary? It’s because we need to compute the gradient for the current policy, and to do that the trajectories need to be representative of the current policy.

But this sounds a little wasteful. What if we could somehow recycle the old trajectories, by modifying them so that they are representative of the new policy? So that instead of just throwing them away, we recycle them!

Then we could just reuse the recycled trajectories to compute gradients, and to update our policy, again, and again. This would make updating the policy a lot more efficient.
So, how exactly would that work?

2. Importance Sampling

This is where importance sampling comes in. Let’s look at the trajectories we generated using the policy \pi_\theta. It had a probability P(\tau;\theta), to be sampled.

Now Just by chance, the same trajectory can be sampled under the new policy, with a different probability P(\tau;\theta')

Imagine we want to compute the average of some quantity, say f(\tau). We could simply generate trajectories from the new policy, compute f(\tau) and average them.

Mathematically, this is equivalent to adding up all the f(\tau), weighted by a probability of sampling each trajectory under the new policy.

\sum_\tau P(\tau;\theta') f(\tau)

Now we could modify this equation, by multiplying and dividing by the same number, P(\tau;\theta) and rearrange the terms.

\sum_\tau \overbrace{P(\tau;\theta)}^{ \begin{matrix} \scriptsize \textrm{sampling under}\\ \scriptsize \textrm{old policy } \pi_\theta \end{matrix} } \overbrace{\frac{P(\tau;\theta')}{P(\tau;\theta)}}^{ \begin{matrix} \scriptsize \textrm{re-weighting}\\ \scriptsize \textrm{factor} \end{matrix} } f(\tau)

It doesn’t look we’ve done much. But written in this way, we can reinterpret the first part as the coefficient for sampling under the old policy, with an extra re-weighting factor, in addition to just averaging.

Intuitively, this tells us we can use old trajectories for computing averages for new policy, as long as we add this extra re-weighting factor, that takes into account how under or over–represented each trajectory is under the new policy compared to the old one.

The same tricks are used frequently across statistics, where the re-weighting factor is included to un-bias surveys and voting predictions.

3. The re-weighting factor

Now Let’s a closer look at the re-weighting factor.

\frac{P(\tau;\theta')}{P(\tau;\theta)} =\frac {\pi_{\theta'}(a_1|s_1)\, \pi_{\theta'}(a_2|s_2)\, \pi_{\theta'}(a_3|s_3)\,...} {\pi_\theta(a_1|s_1) \, \pi_\theta(a_2|s_2)\, \pi_\theta(a_2|s_2)\, ...}

Because each trajectory contains many steps, the probability contains a chain of products of each policy at different time-step.

This formula is a bit complicated. But there is a bigger problem. When some of policy gets close to zero, the re-weighting factor can become close to zero, or worse, close to 1 over 0 which diverges to infinity.

When this happens, the re-weighting trick becomes unreliable. So, In practice, we want to make sure the re-weighting factor is not too far from 1 when we utilize importance sampling